Protein shape sampled by ion mobility mass spectrometry consistently improves protein structure prediction

Ion mobility (IM) mass spectrometry provides structural information about protein shape and size in the form of an orientationally-averaged collision cross-section (CCSIM). While IM data have been used with various computational methods, they have not yet been utilized to predict monomeric protein structure from sequence. Here, we show that IM data can significantly improve protein structure determination using the modelling suite Rosetta. We develop the Rosetta Projection Approximation using Rough Circular Shapes (PARCS) algorithm that allows for fast and accurate prediction of CCSIM from structure. Following successful testing of the PARCS algorithm, we use an integrative modelling approach to utilize IM data for protein structure prediction. Additionally, we propose a confidence metric that identifies near native models in the absence of a known structure. The results of this study demonstrate the ability of IM data to consistently improve protein structure prediction.

All structures in both the ideal and experimental dataset were also predicted with AlphaFold2 12 (AF) and RoseTTAFold 13 (RF) with default settings with and without templates. By default, both methods predict structures with the aid of templates. For AF, to predict with templates the maximum template date was set to one day before the deposited date of the PDB (as per instructions in https://github.com/deepmind/alphafold). This ensured that the benchmark structure itself was not used as template. However, other structures (deposited before the target) were used as templates. The same was ensured for RF by removing the benchmark PDBs from the template database. There is no direct way to predict structures without templates in AF, therefore we employed two different methods to achieve the same goal (as outlined here). The first method was to set the maximum template date to 1900-01-01. In this method, AF searched for templates from the database, however setting the maximum template date this far back effectively ensured that no templates were found in the PDB database (the PDB database was created in 1971). The second method to ignore templates in AF was to make slight changes in the AF source code as outlined in Supplementary Note 3 of this document. In the data shown, we used the first method to remove all templates for AF predictions, however in Supplementary Data 14 we show that the RMSD and TM-Score from both these options were practically identical. For RF, protein structures without templates were predicted by omitting the option "-atab" when running the prediction.
For the experimental dataset, the native structures (ground truth PDBs) used to compare to the predicted models were chosen such that the sequence and experimental conditions most closely resembled the IM experiment (Supplementary Data 2).
Model quality of all best scoring models from the IM score function was further assessed with Voronota 14,15 and P3CMQA 16 (in their default settings) for both the ideal and experimental dataset. To better compare the IM Confidence score to Voronota and P3CMQA, the IM Confidence score was scaled by dividing the confidence score by the most negative confidence score from each dataset (ideal and experimental). Random amounts of noise were generated in CCSIdeal. To create noise for the ideal noise, % noise was randomly selected anywhere from -% to + % and added to CCSIdeal of proteins in the ideal dataset as shown in Supplementary Equation 1. These CCSIdeal with noise were then used to score structures in the ideal dataset. Prediction results with noisy collision cross sections simulated from the ideal dataset (CCSIdeal) data. Random noise at 15% (orange) and 30% (green) was added to CCSIdeal (no noise, blue) for all generated structures (600000 structures) in the ideal dataset. In comparison to (i) 0% noise, the IM score vs TM-Score distribution with (ii)15%, and 30% (iii) noise showed no significant change. (b) In these violin distributions (n = 600000 biologically independent samples over 3 independent random noise simulations), no significant change in the global folds was observed for the best scoring models when random noise was introduced at 15% and 30% as compared to those with 0% noise. The mean and the standard error of mean of TM-Score for the distributions in (b) are 0.857 ± 0.024, 0.841 ± 0.025, and 0.839 ± 0.026 respectively. The white dots represent the median in each violin distribution. The black bar in the center of the distribution is the interquartile range (IQR). The black stretched line extends from the "first quartile -1.5 IQR" TM-Score of predicted structures % Noise simulated for ideal IM data to the "third quartile + 1.5 IQR". Values beyond this range are considered outliers. Source data are provided as a Source Data file.
Supplementary Supplementary Figure 6. Improved structure prediction with ion mobility (IM) data for sequences with poor or no templates. Consistent improvement in model selection was observed when using the IM score function for the subset of 54 proteins where comparative modelling (CM, with nonperfect templates) and ab initio (template-free) protocols were utilized. The predicted models from the IM score function were compared to those of the (a) radius of gyration (RG) and (b) Rosetta (RS) score functions in terms of their respective (i) root mean square deviation (RMSD) and (ii) template modelling score (TM-Score). For both (a) and (b) the subset of models from the ideal and experimental dataset are shown in blue and red respectively. Source data are provided as a Source Data file.
Supplementary Figure 7. Illustration of the ion mobility score term (IMScore_Term). IMScore_Term is a fade function where LB and UB are the lower and upper bound cutoffs set at 10 Å 2 and 100 Å 2 , respectively. This term penalizes structures based on the absolute difference between the experimental collision cross section (CCS) and the structures' predicted CCS. Source data are provided as a Source Data file.

Supplementary Note 1: General usage of PARCS application
A structure is required to run this application. To use PARCS to predict the CCS of given structure(s), users need to specify the full path to the executable of the PARCS application (<path/to/Rosetta>/main/source/bin/parcs_ccs_application.default.<os><compiler>release). User also need to provide the full path to Rosetta database with the flag -database. Next, the structure of the protein for which CCS is to be predicted is specified with -in:file:s (or -in:file:l for list of structures) in a format readable by Rosetta. Users may choose to specify the number of rotations with -ccs_nrots (default is set to 300). Probe radius is set to 1.0 Å by default to predict CCS in helium buffer gas. The other option is to set it to 1.81 Å by using option -ccs_prad to predict CCS in nitrogen buffer gas. By default, the application will save the output containing two pieces of information (the name of the structure file and CCS value in Å 2 ) to a file named 'CCS_default.out'. However, users can define the output file name with the flag -out:file:o. General usage of the command-line option to run CCS calculation on a single structure is shown below, where variables that need to be specified by users are shown in brackets (< >) and are defined below: <path/to/Rosetta>/main/source/bin/parcs_ccs_calc.default.<os><compiler>release -database <path/to/Rosetta>/main/database -in:file:s <structure> -ccs_nrots <number_of_rotations> -ccs_prad <probe_radius_in_angstroms> -out:file:o <output_file_name> • path/to/Rosetta -Users' path to Rosetta • os -Name of operating system (linux, mac, etc.) • compiler -Name of C++ compiler (gcc, clang, etc.) • structure -Name of structure for which CCS calculation is performed.
• number_of_rotations -Number of random rotations for CCS calculations. Must be an integer. Default is set to 300. • probe_radius_in_angstroms -Radius of the buffer gas probe. Default is set to 1.0 Å for helium gas. For nitrogen gas please use 1.81 Å. • output_file_name -User-defined output file name. Default is set to "CCS_default.out".

Example usage of PARCS application to predict CCS of ubiquitin (1UBQ)
In this example the CCS of ubiquitin, with a crystal structure available in the PDB (1UBQ), is predicted with PARCS. Note: In this tutorial we will assume that our operating system is Linux and our compiler is gcc.
1. Create a new directory for input and output files and enter this directory > mkdir calculate_ccs_for_known_structure && cd calculate_ccs_for_known_structure 2. Download the crystal structure of ubiquitin from https://www.rcsb.org/structure/1UBQ into the calculate_ccs_for_known_structure directory 3. PDB files often contain other useful information, such as water molecules, non-standard amino acids, additional molecules pertaining to experimental conditions, etc. However, this extra information may cause Rosetta to fail if the input structure file is not properly prepared. Fortunately, Rosetta offers a python script (clean_pdb.py) to work around this issue. Use this python script on the PDB file and specify the file name and chain of interest. For ubiquitin this is chain A and the command is. > python <path/to/Rosetta>/tools/protein_tools/scripts/clean_pdb.py 1UBQ.pdb A Note: The script clean_pdb.py should produce two files. These are 1UBQ_A.fasta and 1UBQ_A.pdb. For this tutorial only 1UBQ_A.pdb file is utilized. 4. Predict the CCS with 250 random rotations and a probe radius of 1.0 and save output file as '1ubq_predicted_ccs.txt' with the following command.